An Intelligent Multilingual Information Browsing and Retrieval System Using Information Extraction
نویسندگان
چکیده
In this paper, we describe our multilingual (or cross-linguistic) information browsing and retrieval system, which is aimed at monolingual users who are interested in information from multiple language sources. The system takes advantage of information extraction (IE) technology in novel ways to improve the accuracy o f cross-linguistic retrieval and to provide innovative methods for browsing and exploring multilingual document collections. The system indexes texts in different languages (e.g., English and Japanese) and allows the users to retrieve relevant texts in their native language (e.g., English). The retrieved text is then presented to the users with proper names and specialized domain terms translated and hyperlinked. Moreover, the system allows interactive information discovery from a multilingual document collection. 1 I n t r o d u c t i o n More and more multilingual information is available on-line every day. The World Wide Web (WWW), for example, is becoming a vast depository of multilingual information. However, monolingual users can currently access information only in their native language. For example, it is not easy for a monolingual English speaker to locate necessary information written in Japanese. The users would not know the query terms in Japanese even if the search engine accepts Japanese queries. In addition, even when the users locate a possibly relevant text in Japanese, they will have little idea about what is in the text. Outputs of off-the-shelf machine translation (MT) systems are often of low-quality, and even "high-end" MT systems have problems particularly in translating proper names and specialized domain terms, which often contain the most critical information to the users. In this paper, we describe our multilingual (or cross-linguistic) information browsing and retrieval system, which is aimed at monolingual users who are interested in information from multiple language sources. The system takes advantage of information extraction (IE) technology in novel ways to improve the accuracy of cross-linguistic retrieval and to provide innovative methods for browsing and exploring multilingual document collections. The system indexes texts in different languages (e.g., English and Japanese) and allows the users to retrieve relevant texts in their native language (e.g., English). The retrieved text is then presented to the users with proper names and specialized domain terms translated and hyperlinked. The system also allows the user in their native language to browse and discover information buried in the database derived from the entire document collection. 2 S y s t e m D e s c r i p t i o n The system consists of the Indexing Module, the Client Module, the Term Translation Module, and the Web Crawler. The Indexing Module creates and loads indices into a database while the Client Module allows browsing and retrieval of information in the database through a Web browser-based graphical user interface (GUI). The Term Translation Module is bi-directional; it dynamically translates user queries into target foreign languages and the indexed terms in retrieved documents into the user's native language. The Web Crawler can be used to add textual information from the WWW; it fetches pages from user-specified Web sites at specified intervals, and queues them up for the Indexing Module to ingest regularly. For our current application, the system indexes names of people, entities, and locations, and scientific and technical (S~zT) terms in both English and Japanese texts, and allows the user to query and browse the database in English. When Japanese texts are retrieved, indexed terms are translated into English. This system is designed to expand to other lan-
منابع مشابه
Intelligent Agent-based Multilingual Information Retrieval System
The goal of this work is to develop an Open Agent Architecture for Multilingual information retrieval from Relational Database. The query for information retrieval can be given in plain Hindi or Malayalam; two prominent regional languages of India. The system supports distributed processing of user requests through collaborating agents. Natural language processing techniques are used for meanin...
متن کاملSupporting Multilingual Internet Searching and Browsing
The amount of non-English information has proliferated rapidly in recent years. The broad diversity of the multilingual content presents a substantial research challenge in the field of knowledge discovery and information retrieval. Therefore there is an increased interest in the development of multilingual systems to support information sharing across languages. The goal of this dissertation i...
متن کاملTerminology Retrieval: Towards a Synergy between Thesaurus and Free Text Searching
Multilingual Information Retrieval usually forces a choice between free text indexing or indexing by means of multilingual thesaurus. However, since they share the same objectives, synergy between both approaches is possible. This paper shows a retrieval framework that make use of terminological information in free-text indexing. The Automatic Terminology Extraction task, which is used for thes...
متن کاملWebsite Term Browser: Overcoming language barriers in text retrieval
Current search systems fail to satisfy users when the relevant information is written in a foreign language; when the user is not aware of the relevant -perhaps specialized terminology for a given topic; or when the user need is fuzzy and requires assisted search once inside an appropriate web portal. This paper describes an interactive multilingual search system that alleviates such limitation...
متن کاملExploiting a Multilingual Web-based Encyclopedia for Bilingual Terminology Extraction
Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopedias such as Wikipedia as comparable corpora for bilingual terminology e...
متن کامل